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Personality measurement is based on tine idea tliat values on an unobservable latent 
variable determine the distribution of answers on a manifest response scale. Typically, 
it is assumed in the Item Response Theory (IRT) that latent variables are related to the 
observed responses through continuous normal or logistic functions, determining the 
probability with which one of the ordered response alternatives on a Likert-scale item is 
chosen. Based on an analysis of 1731 self- and other-rated responses on the 240 NEO PI-3 
questionnaire items, it was proposed that a viable alternative is a finite number of latent 
events which are related to manifest responses through a binomial function which has only 
one parameter — the probability with which a given statement is approved. For the majority 
of items, the best fit was obtained with a mixed-binomial distribution, which assumes two 
different subpopulations who endorse items with two different probabilities. It was shown 
that the fit of the binomial IRT model can be improved by assuming that about 10% of 
random noise is contained in the answers and by taking into account response biases 
toward one of the response categories. It was concluded that the binomial response 
model for the measurement of personality traits may be a workable alternative to the 
more habitual normal and logistic IRT models. 

Keywords: personality measurement models, mixed-binomial model, Likert-scale, NEO Personality Inventory, 
self- and observer-ratings, response bias, measurement invariance 



INTRODUCTION 

Nearly all currently popular personality questionnaires use mul- 
tiple items paired with a Likert-type response format in which 
respondents specify their level of agreement or disagreement 
with a large number of statements describing multiple person- 
ality traits. For example, agreement or disagreement can be 
expressed on a 5-point scale with the following response cate- 
gories: 1 = strongly disagree, 2 = disagree; 3 = neutral; 4 = 
agree, and 5 = strongly agree (Costa and McCrae, 1992; Goldberg 
et al, 2006). Rensis Likert, who introduced this type of response 
format (Likert, 1932), was obviously interested in capturing a 
larger range of agreement rather than that of a binary black 
and white picture without intermediate shades of gray. Since all 
items belonging to one scale are assumed to be parallel mini- 
instruments which replicate each other, it is common practice 
to sum item raw scores together over a whole range of simi- 
lar items measuring, for instance, angry hostility or openness to 
new ideas. In so doing, Likert scaling obviously assumes that dis- 
tances between each neighbouring response category are equal. 
The subjective distance from "strongly disagree" to "disagree" is 
expected to be exactly the same as the distance from "neutral" 
to "agree." Otherwise it does not make much sense to compute 
the mean score for angry hostility, openness to new ideas, or 
any other personality traits by aggregating scores across items 
belonging to these subscales. Even if researchers can be more 
or less confident that their respondents understand the order 
between response categories, that is that "neutral" is supposed 



to be stronger than "disagree," there is no guarantee that the 
perceived interval between the response categories actually cor- 
responds to the relation between the numerals which are used 
to label response categories. There is no simple and intuitively 
transparent test which could show whether subjective intervals 
between neighbouring response categories are reasonably simi- 
lar or they are different to such an extent that aggregating them 
together could lead to substantial errors. This may be one of the 
main reasons why test constructors very seldom analyse distri- 
butions of responses of individual items. Usually it is considered 
enough to observe the standard deviation and skew of a response, 
in addition to correlations between items, to decide whether this 
item can or cannot be included in a personality questionnaire. 

Most personality researchers seem to be happy with the idea 
that the observed response scores on a personality item are 
determined by some unobserved latent variable (BoUen, 2002; 
Borsboom, 2005). If a person, for example, endorses a statement 
like "I have sympathy for others less fortunate than me" then she 
or he is supposed to have an enduring tendency to be moved, 
more than average, by others' needs and be inclined to empha- 
size with the human side of social policies. Thus, if there is an 
item which is supposed to measure a trait called A6: Tender- 
Mindedness, then a disposition has to really exist that makes 
people feel sympathy and concern for others (Costa and McCrae, 
1992). The basic idea of the Item Response Theory (IRT) is that a 
mathematical function exists that describes relationship between 
the latent variable, representing the trait, and responses on a given 
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item (Lord, 1980; van der Linden and Hambleton, 1997; Reise 
et al., 2005). This mathematical function is usually called Item 
Response Curve (IRC; also known as Item Characteristic Curve) 
or, when the number of response alternatives is not dichotomous. 
Category Response Curve (CRC) (Reise and Waller, 2009). 

The choice of the mathematical function relating latent vari- 
ables to manifest responses has largely been a matter of conve- 
nience, rather than a deliberate theoretical justification. Usually 
it is just postulated that there is a psychological continuum of 
latent states which is related to response categories through a 
process described sufficiently well by a normal distribution (c£, 
Thurstone, 1927). The normal distribution was chosen to rep- 
resent IRC or CRC mainly due to its own "good properties" 
(completely specified by the mean and standard deviation, sum 
of two normal distributions is again normal distribution etc.), not 
due to any serious factual evidences. As a proof, normal ogive was 
often replaced with a more popular logistic function which was 
computationally less demanding than polynomial approxima- 
tions required for the normal distribution. However, with a trivial 
rescaling using a single constant it is possible to make the two- 
parameter logistic function closely approximate the cumulative 
normal ogive (Camilli, 1994). 

Although there are numerous applications of IRT methods 
to personality data (Chernyshenko et al., 2001; Maij-de Meij 
et al, 2008; Samuel et al, 2010; Spence et al, 2012; Stepp et al, 
2012) there is still an on-going debate whether IRT can be used 
as extensively in personality measurement as it is in cognitive 
measurement (Reise and Widaman, 1999; Chernyshenko et al., 
2001; Reise and Henson, 2003). Researchers very seldom report 
IRT model fit to their personality data at the level of single 
items since a usual Pearson goodness-of-fit statistic (j^) is often 
too restrictive to provide a tolerable agreement between empir- 
ical data points and model's prediction in the strict statistical 
sense (Stone and Zhang, 2003; Swaminathan et al, 2007; LaHuis 
et al., 2011). In order to relieve fitting requirements several non- 
parametric IRT methods were proposed in which only invariant 
item ordering is taken into account (Mokken and Lewis, 1982; 
Scheiblechner, 1999, 2007; Sijtsma and Molenaar, 2002; Meijer 
and Baneke, 2004). In many cases nonparametric IRT models 
seem to be clearly superior of normal or logistic response func- 
tions (Chernyshenko et al., 2001; Sijtsma and Molenaar, 2002; 
Reise and Waller, 2003). 

It is somewhat surprising that latent variables in IRT models 
are almost always conceptualized as an infinite set of events or 
psychological states. Even if to assume that two popular func- 
tions, normal and logistic, do not need to extend from minus 
to plus infinity it is still presumed that a latent variable is iso- 
morphic to a considerable segment cut out from a continuum of 
real numbers. However, continuous response functions are not 
the only option. At least logically nothing prevents us to consider 
latent variables represented by a finite set of events or discrete psy- 
chological states. Each of these events or states can occur with 
a certain probability and combination of these latent events are 
supposed to determine the choice of an explicit response cate- 
gory, for instance choice of one of the response options on a Likert 
scale. Of course, there are many discrete probability functions 
but the binomial distribution comes first to the mind. Indeed, 



any introductory textbook of probability theory starts with an 
example of coin-tossing which outcomes can be described by a 
binomial function. As a matter of fact, it was shown that the 
Rasch model, one of the basic tools of IRT, fits coin-tossing data 
very well (Wood, 1978), but it appears that very few have thought 
about coin-tossing itself, or any other similar process consisting 
of discrete events, as a prototype for a latent variable. 

Next I would like to present an example of a binomial function 
providing a good approximation to how answers are distributed 
on a Likert scale. For instance, in a recent cross-cultural study 
(Mottus et al., 2012a), participants from 21 countries evaluated 
their own Conscientiousness on 6 NEO Personality Inventory 
facets Competence, Order, Dutifulness, Achievement Striving, 
Self- Discipline, and Deliberation using a bipolar rating scale with 
the negative side of the trait described on one end of the scale 
and the positive side on the other (Terracciano et al., 2005). For 
the Competence facet, for example, participants had to rate, on 
a five-point scale, their position between the end points of the 
trait defined as "capable, efficient, competent" and "inept, unpre- 
pared." Figure 1 demonstrates response frequencies pooled across 
these six items and all participants for three different countries: 
Benin (blue circles). South Korea (red squares), and the United 
States (green rhombus). It is clear that, on average, Benin partic- 
ipants rated their conscientiousness higher than U.S. participants 
who, in turn, judged their own conscientiousness higher than 
South Koreans who were one of the lowest among the 21 nations. 

These three distributions of answers are surprisingly similar to 
a binomial distribution b(ri, p^, n) with the probabilities pB = 
0.867, 0.621, and 0.731 respectively. Figure 1 demonstrates three 
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FIGURE 1 I Aggregated response frequencies on 6 Conscientiousness 
items for three countries: Benin (blue circles). South Korea (red 
squares), and the United States (green rhombus) (Mottus et al., 2011). 

Three binomial distributions with values of parameter pe = 0.867, 0.621, 
and 0.735 are shown by blue, red, and green lines. 
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binomial distributions shown by lines corresponding to these val- 
ues which rather closely foUow empirically observed response 
frequencies. It is perhaps even needless to say that approximation 
of the same data with a two-parameter normal CRC provides a 
relatively poor fit to these data. 

The binomial distribution is the discrete probability distribu- 
tion of the number of successes n = 0, 1 , . . . , m in a sequence of n 
independent experiments each with two possible outcomes ("yes" 
or "no"), with a fixed success ratepB across all these experiments: 

r,!(M — r,)! 

In order to separate the binomial probability pB from the sym- 
bol of statistical significance a subscript "B" referring to binomial 
distribution is used. In confirmation of the visual impression, 
a Pearson chi-square (x^) test (the sum of the squared differ- 
ences between the observed and predicted frequencies, divided 
by the predicted frequency) showed that, in all three cases, 
the binomiaUy expected frequencies do not differ significantly 
from the empirically observed response probabilities (p < 0.05). 
Consequently, the empirically observed response frequencies 
seem to be predicted solely by the parameter in the binomial 
distribution h(r[, ps, 4). In the IRT terminology pB corresponds 
to the item "difficulty" although in the personality domain it is 
pointless to describe, for example, extraversion more "difficult" 
than introversion or vice versa. To express the binomial func- 
tion b(ri, pB> 4) in an explicit form, the probabilities /i , /2 , . . ./s 
that one of the five answers on a Likert-scale was chosen can be 
predicted by the following five equations: 
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double subscript corresponding to a single subject and a specific 
item, into the logistic form. In the particular case where n = I 
(responses are scored as either 0 or 1) the logistic transforma- 
tion reduces to a Rasch's simple logistic model (Rasch, 1960). In 
other words, the binomial response model was transformed into 
generalized logistic model and because of it lost its self-sufficient 
meaning. Indeed, if a binomial distribution can be approximated 
and replaced either by logistic and normal functions then there 
is no need for extra assumptions beyond those of the already 
existing and well-known IRT models. This model was applied to 
16 items asking how much children liked school. After eliminat- 
ing 18 out of 309 fifth-year schoolchildren and 3 poorest fitting 
items a good agreement between the model and data was achieved 
(Andrich, 1978). 

In this paper, however, I entertain an idea that a binomial 
process itself could be an adequate manifestation of a latent vari- 
able which in consequence determines distribution of answers on 
a Likert scale. One possibility for conceptualizing this binomial 
process is to imagine that respondents, before giving their answer 
to an item, execute a latent binomial mini-experiment. There is 
no need that the subject is aware of this mini-experiment or she 
or he is able to describe it introspectively. This hypothetical mini- 
experiment consists of four (one less than the number of response 
alternatives) mini-trials, each of which requires a simple indica- 
tion whether or not the respondent agrees with the content of 
the posed question. The number of the successful items in this 
mini-experiment determines the answer chosen on the Likert- 
scale item. This binomial mini-experiment needs to satisfy the 
following four conditions: 

( 1 ) There are four hidden mini-trials which together produce five 
possible outcomes; 

(2) Each mini-trial is independent of the others; 

(3) There are only two outcomes — "yes" or "no" — in each mini- 
trial, expressing endorsement of the core idea articulated in 
the item; 

(4) The probability of each outcome remains constant through- 
out all four mini-trials. 



If responses are close to these five predicted probabilities, as it is 
in Figure 1, then it is likely that these responses are related to a 
latent variable by a binomial process. It is relevant, however, to 
recognize that binomial distribution having only one parameter 
pB is more restrictive than normal distribution determined by its 
mean and standard deviation. For example, on a 5-point Likert 
scale it is impossible that the middle response category R3 can be 
chosen with a probability higher than 0.375. Thus, any empiri- 
cal distribution of answers in which the midpoint of the scale is 
chosen with higher frequency than 37.5% is incompatible with 
binomial distribution. Of course, there is no such restriction for 
normal distribution. 

Up to now this narrative replicates almost exactly an excellent 
paper written by David Andrich who introduced a binomial latent 
trait model for the study of Likert-type items (Andrich, 1978). 
However, from this point onward the current exposition deviates 
from what was proposed by Andrich (1978). As a next step, he 
demonstrated how to transform binomial probability pB, with a 



When these four conditions are fulfilled, the degree of endorse- 
ment of an item is described by the binomial function. Of course, 
the metaphor of a binomial mini-experiment is only one way to 
conceptualize this latent decision process. There are many other 
possibilities to describe the judgments that respondents make 
about their own or somebody else's personality traits when con- 
fronting a Likert-type scale. What is essential is for the binomial 
distribution response categories that were chosen to have truly 
numerical equivalents. For example, the choice response category 
Rl ("strongly disagree") corresponds to 4 "mini-disagrees," R2 
("disagree") corresponds to 3 "mini-disagrees" and one "mini- 
agree" until R5 ("strongly agree"), which is the result of 4 "mini- 
agrees." Since outcomes can be expressed in enumeration, this not 
only entails that the response categories are ordered Rl < R2 < 
R3 < R4 < R5, but that the intervals between two successive cate- 
gories remain numerically the same across the whole Likert scale. 
Indeed, if the response distribution on a Likert scale item sat- 
isfies a binomial distribution, it is justified to operate with the 
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responses as natural numbers. Since two and four are two and 
four mini-endorsements, respectively, it is absolutely justified to 
say that, in sum, there were 6 mini-endorsements or exactly 3 
mini-endorsements per item. In short, the respondents seem to 
operate with the response categories as if they were natural num- 
bers that can be counted, provided that responses obey binomial 
distribution. 

Of course, not all empirically observed response distribu- 
tions on a Likert-type scale obey a binomial distribution. From a 
statistical point of view, it is highly unlikely that all response prob- 
abilities are strictly related to one another by a set of restrictive 
formulas depending on only one free parameter — the probability 
of endorsing a latent statement. As a confirmation of this observa- 
tion, many actual response distributions in the above mentioned 
study depart considerably from the binomial law (Mottus et al., 
2012a). However, to the best of my knowledge, in very few studies 
binomial function was used as a response function which con- 
nects latent states to explicit responses on a Likert scale item (cf. 
Andrich, 1978). In some other applications binomial function 
was used to describe sampling or some other processes that are 
relevant for the IRT (Lord and Novick, 1968; van der Linden, 
1979; Gross and Shulman, 1980). 

One of the main goals of this paper is to develop a binomial 
IRT model which can be used as a benchmark relative to which 
better known two-parameter normal or logistic IRT models can 
be evaluated. In this study I am not intending to study all 30 sub- 
scales of the NEO Personality Inventory each of which consists of 
eight items. More modest goal is to observe how the distribution 
of answers on each 240 Likert-type items can be fitted in isola- 
tion with a binomial function. It is very unlikely that one specific 
IRT model is superior of others in majority of conditions. More 
realistic is to elucidate not very well understood characteristics of 
data which make them better or worse harmonizing with different 
response functions. 

Beside this central goal there are some other problems to 
solve one of which is related to a well-known fact that Likert 
scales are vulnerable to several distortions. Respondents may 
have, for instance, a tendency to use extreme categories (extreme 
response style) more often than could be expected from purely 
statistical considerations (McGrath et al., 2010; Mottus et al., 
2012b). Likewise, some other group of respondents may have 
an inclination to use a middle category ("neutral," "not sure," 
"undecided") more frequently than could normally be expected 
(Dubois and Burns, 1976; Hernandez et al., 2004; Kulas and 
Stachowski, 2009). Although operationally it is easy to define 
the extreme response style as a frequency of using extreme 
response categories ("strongly agree" and "strongly disagree"), 
it is tremendously difficult to separate the "normal" use of 
any of these response categories from their disproportionate or 
biased use. In other words, my proposal is to use a binomial 
response model as a reference relative to which biases can be 
consistently defined. Deviations from the expected theoretical 
distribution are natural candidates in the search for response 
biases. 

One reason why not all empirically observed response dis- 
tributions on a Likert-type scale foUow a binomial distribution 
may be that not all respondents understand and answer items 



identically. Constructors of personality questionnaires mostly 
take for granted that all respondents use the response categories 
to an item in basically the same way. In many cases, this is noth- 
ing but good faith because until recently there were no versatile 
tools to identify different subgroups of respondents who inter- 
pret response categories differently. Fortunately, several mixed- 
measurement or latent class models were developed to identify 
two or more subgroups of respondents who interpret response 
categories in a different way (Rost, 1990, 1991; Bolt et al., 2001; 
Asparouhov and Muthen, 2008; Maij-de Meij et al, 2008; Muthen 
and Asparouhov, 2009). A bad fit to a binomial response model 
may be an indicator of at least two groups whose response pat- 
terns can be described by two different binomial response models. 
One of the main goals of this paper is to test how much a simple 
binomial response model can be amended by a mixture-binomial 
model which assumes two different subpopulations that have 
two different probabilities pB with which they endorse a given 
item. 

METHODS 

Participants came from the Estonian Genome Center (EGG) bio 
bank, which is currently affiliated with the University of Tartu 
(for details see www.biobank.ee). The EGC participants have 
been randomly selected from individuals visiting general prac- 
titioners (GP) and hospitals and recruited by GPs and hospital 
physicians. All participants have provided informed consent. In 
addition to donating blood samples and answering a medical 
questionnaire, participants were asked to complete the self-report 
Estonian version of the NEO Personality Inventory-3 (NEO PI- 
3; McCrae et al., 2005) and also find a knowledgeable infor- 
mant who would complete the same questionnaire about them. 
The NEO PI-3 is a sHghtly modified version of the NEO PI- 
R questionnaire (Costa and McCrae, 1992) that was adapted 
into Estonian by Kallasmaa et al. (2000). Like its predecessor 
the NEO PI-R, the NEO PI-3 has 240 items which measure 
30 personality traits grouped into the five personality domains: 
Neuroticism (N), Extraversion (E), Openness (O), Agreeableness 
(A), and Conscientiousness (C). Each domain is represented by 
6 facets. For example, Neuroticism is measured by Nl: Anxiety, 
N2: Angry HostiUty, N3: Depression, N4: Self-Consciousness, N5: 
Impulsiveness, and N6: Vulnerability. Each facet is measured by 8 
items, approximately a half of which are keyed in the opposite 
direction. Participants were asked to express their agreement or 
disagreement with the content of each item on a 5-point scale 
with the following response categories: 1 = strongly disagree, 2 = 
disagree; 3 = neutral; 4 = agree, and 5 = strongly agree. Some of 
these personality data have been used in other papers (e.g., Allik 
et al., 2010b; de Moor et al, 2012; Mottus et al, 2012c,d). 

In total, the sample used in the present study included 1731 
people (of whom 991, or 57.3%, were women) with a mean age of 
42.8 years (SD = 16.5, ranging from 18 to 89 years). Informants 
were slightly younger than their targets with a mean age 39.9 
years {SD = 15.5, ranging from 12 to 89). Of the informants, 52% 
were spouses or partners, 15% friends, 12% parents, 9% children 
or grandchildren, 6% siblings, 3% acquaintances, and 3% were 
categorized as other relatives. Women comprised 68.9% of the 
informants. 
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RESULTS 

COMPARISON OF ONE-PARAMETER BINOMIAL (1PBM) AND 
TWO-PARAMETER NORMAL (2PNM) MODELS 

To begin with, I searched for the binomial function b(r, p^, 4) 
which could best predict response frequencies on all 240 NEO 
PI-3 items for both self- and observer-ratings. In terms of the 
IRT classification, this is one-parameter binomial model (IPBM) 
applied to each item in isolation. The program looked through 
all possible pB -values and selected the one which provided the 
best fit to the observed response frequencies in terms of the min- 
imal -value. Since -statistics depend directly on the number 
of observations N = 1731, it was unrealistic to expect a gener- 
ally good statistical fit. Indeed, only one other-rated item !239 
"He/she would rather be known as 'merciful' than as 'just'" had 
a response distribution sufficiently close to a binomial function 
at the level of 5% significance: x^4) = 5.668, p = 0.225. Figure 2 
demonstrates the distribution of observed and predicted response 
frequencies for the item !239. All other items demonstrated poor 
fit to a binomial distribution, in terms of the x^ -statistics, at least. 

It is important to stress that there was no substantial loss of 
information in computing the pB -value for each item, compared 
to a more customary arithmetic mean of all responses. The mean 
scores of all 240 NEO PI-3 items were extremely highly correlated 
with the computed binomial probabilities p-g, identically for self- 
and other-ratings (r = 0.998, p < 0.000001). 

Next, for a comparison, I fitted two-parameter normal model 
(2PNM) to the same set of data. In spite of an additional parame- 
ter 2PMN approximation was not palpably superior of the IPBM 
approximation. Out of 480 response patterns not a single x^- 
value was below 9.4877 (5% significance level for four degrees of 
freedom). Also both the Kolmogorov-Smirnov and Shapiro-Wilk 
tests rejected the hypothesis that the observed distributions of 



responses were normal in each and every case by a considerable 
margin. 

In order to compare fits of these two models, IPBM and 
2PNM, x^ -values for self- and other-ratings were arranged into 
increasing order. Figure 3 depicts the pattern of x^ growth for 
these two models for both self- and other-ratings. 

Generally speaking, there is no substantial difference between 
one-parameter binomial and two-parameter normal approxima- 
tion. It seems, however, that other ratings are slightly better 
approximated by both binomial and normal models. The median 
X^-values were 173.9 and 215.6 for 2PNM and IPBM respec- 
tively for self-ratings. For other ratings these x^-values were 
slightly lower: 160.2 and 168.2 respectively. It is important to 
mention that using some other goodness of fit criteria that 
could account for differences in the number of parameters (e.g., 
Akaike Information Criterion which in the current case is AIC = 
X^ -|- 2k, where k is the number of parameters) was not able to 
demonstrate indisputable advantage of either IPBM or 2PNM 
models. 

ESTIMATION OF POSSIBLE NOISE 

Because the chi-square statistic is very stringent, especially at a 
large number of observations, more realistic measures should be 
considered. One promising idea is to determine how resistant 
the binomial fit is to the addition of random noise. The basic 
idea is to add a different percentage of random noise to a per- 
fect binomial distribution for establishing how much noise can 
be tolerated before binomial approximation starts to deteriorate. 
If all JV = 1731 responses on a single item are distributed exactly 
according to the binomial function b(ri, p^, 4), then the level of 
noise is evidently zero and there will be no approximation error. 
Let suppose that not all 100% of responses are distributed bino- 
mially, but only a fraction of them. The remaining percentage of 
answers K will be distributed randomly, so that the occurrence 
of each response alternative J?; has an equal probability. Thus the 
ratio 100 • K/N determines the percentage of noise. By increas- 
ing the percentage of noise we can observe how the average misfit 
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increases with the added noise. In order to make such a simulation 
maximally realistic, I selected 240 previously obtained binomial 
pB-values which were used to approximate self-ratings with a 
binomial function (the meanpB-value was 0.56 with SD = 0.13). 
Figure 4 demonstrates how the binomial fit tolerates addition of 
different amounts of random noise. The abscissa is the proba- 
bility that the obtained -statistic exceeds the 5% significance 
level, which, for four degrees of freedom, is = 9.4877. This 
figure tells us that slightly more than 5% noise makes already 
half of the NEO PI-R 240 items misfit to a binomial distribu- 
tion, even if 95% of answers were given according to a perfect 
binomial law. If there is 12% of noise, then it would be totally 
unrealistic to expect that any of the 240 items has a statistically 
significant agreement with a theoretical prediction based on the 
binomial function. It is important to remember that if we divide 
this 12% noise equally between five possible response categories 
then already 2.4% deviation from the expected frequency would 
be an intolerable imprecision. It is perhaps worth mentioning that 
typical estimates of measurement error in personality research 
based on Likert-type scales are usually estimated higher than 12% 
(Viswesvaran and Ones, 2000; McCrae et al, 2001, 2011). On 
the whole, this hypothetical level of noise can be used as a more 
realistic estimate of the approximation quality of the binomial 
function. 

Strictly speaking this modeling result means that by adding 
one additional parameter — random noise — to IPBM it is possible 
to make a binomial model entirely fit with the observed response 
patterns. What is particularly noteworthy it is sufficient that only 
in a limited number of trials the attention of respondents is dis- 
tracted and they respond erratically to a relatively small number 
of items. 




0 1 2 3 4 5 6 7 8 9 10 11 12 13 
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FIGURE 4 I Proportion of items predicted to misfit at the 5% 
significance level on /^-statistics as a function of random noise 
percentage. 



MIXED BINOMIAL MODEL (3PBM) 

There were few items which response patterns could be fit- 
ted with neither binomial nor normal function in principle. 
These response patterns typically have two maximums and 
the minimum in the middle of the response scale (V-shape). 
The worst prediction was for the self-rated item !ll2 "I tend 
to avoid movies that are shocking or scary," which measures 
E5: Excitement Seeking. Figure 5 demonstrates the observed 
response probabilities (blue) and the best fitting binomial pre- 
diction (red). Please ignore the green lines and symbols for the 
time being. Obviously, the IPBM cannot provide a reasonably 
good fit to an empirical distribution which has such a distinc- 
tive V-shape: X(4) = 2654.1, p < 0.000001. As expected, these 
two more or less polarized groups who either like or dislike 
scary movies are distinctive by demographic variables. These 
were dominantly women and participants with higher education 
who disliked horror movies. Among those who enjoyed shock- 
ing experiences were a disproportionally large number of men 
who had only elementary education. It cannot be excluded that 
some of participants ignored the negative form "don't like" in 
Estonian ("ei meeldi") and answered as if it were an affirmative 
item. 

It is likely that different groups of participants use and inter- 
pret Likert items in different ways. A characteristic V-shape may 
be suggestion to use a mixed binomial model instead of IPBM 
which seems to be unable to explain this type of response pat- 
tern. Technically, it means that different groups rely on different 
binomial processes, which are governed by two different binomial 
parameters, ps and ^b- While partitioning normally distributed 
variables into different subgroups is not an easy task (e.g., Lahti 
et al., 2002), it is certainly less problematic with a binomial 
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FIGURE 5 I Observed (blue) and predicted (red) response probabilities 
for the self-rated item i112 (E5: Excitement-Seeking). The green shows 
a mixed-binomial approximation with two differently weighted binomial 
components pg and qg. 
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distribution which depends only on one parameter rather than 
two, as is the case with a normal distribution. 

A program was written to search for the best fit for the 
following mixed-binomial functions: 

fi = w- biri,pB, 4) -I- (1 - w) • bin, <jb, 4), (7) 

where /i is the expected response probability of the i-th response 
alternative (i = 0, 1, . . . , 4), pB and are two different bino- 
mial probabilities, and w is the weight with which the first of the 
two binomial components is taken. Since there were 3 parame- 
ters in the model we can call it three-parameter binomial model 
(3PBM). The best fit was searched for with an exhaustive method 
among all possible combinations ofpB, '?B> and w. 

A significant improvement can be defined as a reduction in 
the -statistic for a mixed-binomial model or 3PBM of at least 
5%, compared with a single binomial model fit. According to this 
criterion, there were 25 (10.4%) self-rated and 31 (12.9%) other- 
rated items which failed to show any improvement by applying a 
mixed-binomial model. The ranked distribution of -values for 
self-rated and other-rated items is shown in Figure 3. The good- 
ness of fit has considerably improved relative to both IPBM and 
2PNM. As the quality of the approximation improved, 18 (7.5%) 
self-rated and 36 (15%) other-rated items achieved agreement 
between the observed and predicted frequencies at a 5% signif- 
icance level. On average, the -statistics improved by 52% for 
both self- and other-ratings. In spite of this substantial improve- 
ment, the median x^ -statistics remained clearly above the critical 
5% significance level: 81.7 and 62.4 for self- and other-ratings, 
respectively. This improvement is not surprising considering the 
fact that a quite large number of response distributions demon- 
strated V-pattern at least partially. For example, the number of 
response patterns for which the response frequency to the middle 
category R3 was smaller than to the neighbouring response cate- 
gories R2 and R4 simultaneously was unexpectedly high: 45 and 
34% out of all 240 items had a V-shape in the middle of the scale 
for the self- and other-ratings respectively. Thus, there was per- 
sistent tendency to use a middle category ("neutral" or R3) less 
frequently than could be normally expected. Since neither bino- 
mial nor normal distributions can have a local minimum in the 
middle of the scale it is not surprising that simple response models 
IPBM and 2PNM did not have a good fit to data. 

Unfortunately, there is no possibility to compare 3PBM per- 
formance with a mixed normal model. In addition to two nor- 
mal distributions with their means and standard deviations it 
is also necessary to specify weight w with which the first of the 
two normal components is taken. In the result the number of 
model's parameters equals to the number of response categories 
which makes approximation meaningless: the number of model's 
parameters is equal to the number of observed variables. 

It is interesting to notice that the computed x^ mean val- 
ues for self- and other- ratings correspond to about 13 and 
11% noise levels, respectively, which were determined by the 
above described simulation procedure. These two estimates are 
slightly higher than that which could be deduced from Figure 3. 
Considering the number of items which passed or failed to pass 
the x^ significance test at the 5% level, the estimates of the noise 



percentages are closer to 8 and 9% for other- and self-ratings, 
respectively. 

There is no space to reproduce approximation statistics for all 
240 NEO PI-3 self- and other-rated items. For reasons of econ- 
omy, only the 20 best and worst items for both self-ratings and 
other-ratings are reproduced in Table 1. 

Many items appear in both sides of these lists of self- and 
other-ratings. This is not surprising because the mean values of 
self-ratings are highly correlated with the mean values of other- 
ratings over all the 240 NEO PI-R items (r = 0.90, p < 0.00001). 
The correlation between the x^-statistics (r = 0.62, p < 0.00001) 
also indicates that many items which are either good or bad in 
terms of a mixed-binomial approximation are sufficiently similar 
for self- and other-ratings. However, it is also evident that other- 
ratings obey a binomial distribution slightly more precisely than 
self-ratings. 

How well an item was approximated by a mixed-binomial 
model also depended on which personality dimension it was 
supposed to measure. Figure 6 demonstrates the mean x^ 
approximation of errors for items belonging to one of the five 
personality dimensions. Items which were created to measure 
Openness demonstrated the best fit with the mixed-binomial dis- 
tribution. In both of the categories of ratings, self-ratings and 
observer-ratings, the most problematic were the Neuroticism and 
Conscientiousness items, which exhibited on average the worst 
fit to a mixed binomial distribution. Interestingly, when I com- 
puted the mean number of potentially biased items, the general 
shape replicated the pattern shown in Figure 6. A biased item was 
defined as one which had at least one response category on which 
theoretically expected and empirically observed frequencies dif- 
fered by at least 10% (there were 56 self-rated and 41 other-rated 
items which were biased). 

One possibility is that the x^ approximation error is some- 
how related to the content of the items. For example, items with 
higher or lower social desirability content may demonstrate a 
different fit to the theoretical model. In order to study this pos- 
sibility, I computed the mean x^ approximation error for each 
of the 30 NEO PI-R subscales and compared these mean values 
with previously reported social desirability ratings for the same 
subscales (Konstabel et al., 2006). Approximation errors did not 
correlate significantly with social desirability ratings (r = —0.09, 
p = 0.623 and r = —0.17, p = 0.363 for self- and other-ratings, 
respectively). No evidenced was found that the approximation 
error is related to the visibility of personality traits (cf. Allik et al., 
2010b) or from whose perspective they are judged (Allik et al., 
2010a). 

One telling example for the dramatic improvement after a 
mixed-binomial approximation is the above mentioned item il 12 
(Figure 6). The worst item in a single binomial prediction turned 
into an exemplary item in a mixed-binomial prediction: x^4) = 
1.255, p = 0.868. It is very likely that a characteristic V-shape is 
determined by two disparate binomial functions with the param- 
eters pB = 0.08 and = 0.80, respectively. Since the weight of 
the first function w = 0.66 is about two times larger than the 
weight of the second function (1 — w) = 0.34, the subgroup of 
those who tended to disagree was about twice more populous 
than the subgroup whose members were inclined to agree with 
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Table 1 | Twenty best and worst NEO PIS self- and other-rated items in terms of the x^-statistlc for a mixed-binomial approximation. 
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Rnk 


Item 


Subscale 


Pb 




w 




lmpr% 


Item 


Subscale 


Pb 


<7b 


1/1/ 




lmpr% 




/23 


05 
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0.97 
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E5 
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0.66 
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100.0 
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0.22 
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0.69 
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99.6 
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0.60 
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0.36 
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/1 30 


02 


0.87 


0.34 


0.93 
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0.56 


188.36 


38.8 


222 


/1 15 


C5 
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/1 21 
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lmp%, the percentage of improvement of a mixed-binomial approximation (3PBM) relative to a single binomial approximation (1PBM): * Items which appear in both 
parts of the list in self- and other-ratings. Missing qs and w values mean that a mixed-binomial function improved the approximation less than 5%. 



the statement. Fascinatingly, a mixed-binomial approximation 
of item ill2 was also successful for other-ratings [X(4) = 4.834, 
p = 0.695], with parameters quite similar to those of self-ratings: 
pB = 0.14, qB = 0.73, and w = 0.55. 

DETECTING RESPONSE BIASES 

However, the majority of items failed to reach a significant 
fit, even after applying a mixed-binomial model. Besides the 



constraints of the distribution themselves, the binomial dis- 
tribution, as was already mentioned above, is a very tight distri- 
bution. For example, it excludes the possibility that the middle 
neutral category R3 could be chosen with a higher probability 
than 0.375. This limitation is clearly seen in the item !l34 "I'm not 
known for my generosity" (reversed), which is shown in Figure 5. 
Since a mixed-binomial prediction failed to improve upon a 
single binomial prediction, only the latter is demonstrated. The 
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g I'm not known for my generosity (A3: Altrusim, 11 34, reversed) 

Teiste arvates pole ma suuremeelne Inimene. 

160 I ■ ' ' • ' ] 0.6 I • • . • 




Personality dimensions '^Sree Disagree 



FIGURE 6 I (A) The mean approximation error for items belonging to one of the five personality dimensions: N, Neuroticism; E, Extraversion; 0, Openness, 
A, Agreeableness, and C, Conscientiousness. (B) Observed (blue) and predicted (red) response probabilities for the self-rated item /134 (A3: Altruism). 



main reason for the extremely bad approximation [^^4) = 353.8, 
p < 0.0001] is the much higher than expected exploitation of the 
neutral category. There was no chance that a binomial function 
could have explained the 56.4% with which the neutral response 
was chosen. 

This example demonstrates that if there is a bias toward one 
response category, where this is preferred more or less frequently 
than is expected by a binomial distribution, then it is unlikely that 
even a mixed-binomial model can provide a reasonably good fit to 
the observed response frequencies. On the other hand, deviations 
from the expected binomial distribution can be used as a crite- 
rion for a more precise definition of biases. Thus, bias is not every 
choice of response category, but only when such a choice has a 
sample level probability that is considerably higher or lower com- 
pared to theoretical expectation. Usually, researchers talk mainly 
about the frequencies with which the middle (neutral) or extreme 
response categories are used. However, not only the middle or 
extreme response alternatives can demonstrate bias. There are 
several items which apparently demonstrate a new type of bias 
which could be called a "weak agree" or "weak disagree" category 
bias. One of these items is shown in Figure 7. Like the previous 
example, this item, "I keep myself informed and usually make 
intelligent decisions" {i65), failed to improve after the mixed- 
binomial approximation was applied. Again, there was no chance 
that a binomial function can even reach close to the 62% of cases 
with which the answer "Agree" was chosen (a theoretical maxi- 
mum is only 42.2%). For this reason, item !65 demonstrated the 
second worst fit: ^(4) = 186.8, p < 0.0001. It seems as if a con- 
siderable number of respondents were too shy to admit that they 
are well informed and make intelligent decisions. It seems that, 
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FIGURE 7 1 Observed (blue) and predicted (red) response probabilities 


of the self-rated item f65 {C1: Competence). 



for them, it was socially more desirable to support this statement 
moderately, not strongly, which could violate a socially desirable 
level of modesty. 

There is no chance that strongly biased items can be tolera- 
bly approximated by either binomial or any continuous response 
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functions. Also, there seems to be no simple rule which could 
describe why respondents start to use one of the response cat- 
egories more or less frequently than it could be customarily 
expected on the basis of some model. This means that the best 
but not necessarily elegant solution would be an introduction of 
a certain number of ad hoc variables which would cover large 
deviations from the expected response frequencies. 

DISCUSSION 

It was remarkable that binomial distribution with a single param- 
eter provided generally not significantly worse fit compared with 
two-parameter normal response curve. Although by formal fit 
values it was indecisive which of these two models is supe- 
rior the single-parameter binomial model has several conceptual 
advantages: 

(1) The one-parameter binomial model, IPBM, has only single 
parameter to be determined which is a considerable advan- 
tage before at least two parameters of the normal or logistic 
models; 

(2) Binomial is a discrete probability distribution assuming a 
limited number of latent events which is in a sharp contrast 
with normal and logistic models both of which are presuming 
an infinite continuum of latent states; 

(3) Although there has been a tremendous progress in the com- 
puting power which changes our perspective on the compu- 
tational difficulty it is still much easier to determine values of 
a binomial function than to compute normal or even logistic 
functions; 

(4) The only parameter of binomial function probability pB has 
more transparent interpretation than the "difficulty" param- 
eter in normal and logistic models'. Unlike other IRT param- 
eters binomial probability is dimensionless and completely 
independent of the number of response alternatives on the 
Likert scale. 

The conception of latent variable in the form of a real numbered 
continuum seems so simple and intuitively irresistible that no one 
ever really contested it. Even though some thoughts occasion- 
ally went on more simple discrete probability distributions they 
were immediately brought back to normal or logistic functions 
(Andrich, 1978). In fact, nothing but habits and preconceptions 
prevents us from assuming that traits are represented not by an 
infinite set of events but rather by a finite set of psychological 
states. Of course, the idea of a latent performance of a series 
of mini-trials may seem initially extravagant. However, before 
reaching a premature verdict it is relevant to remember that the 
idea of latent mini-trials is nothing but a metaphor which pri- 
mary function is to emphasize basic properties of an underlying 
process. There is nothing weird, for instance, in the assumption 
that personality traits such as feeling sympathy and concern for 
others — Tender-Mindedness — are represented by a set of seman- 
tic primes which are silently tested when an answer is searched for 
a given personality item (cf, Wierzbicka, 1996). 



One can imagine how confusing it is to explain what the item's "difficulty" is 
when a typical personahty item has a wording "I'm easy-going." 



Without doubt, the strength of IRT consists not only in select- 
ing a response curve to fit distribution of responses on an isolated 
item. A real power stems from a simultaneous analysis of multiple 
items related to a hypothetical underlying latent variable (van der 
Linden and Hambleton, 1997; Reise and Waller, 2009). However, 
the choice of the particular response function connecting latent 
trait to manifest responses has obvious consequences for some 
other inferences made on the basis of the IRT which could lead 
to misleading results (Meijer and Baneke, 2004). For example, it 
is interesting to discuss what would happen to the problem of 
measurement invariance provided that binomial response func- 
tion proved to be a correct one in a given situation. Measurement 
invariance is a property of measurement that indicates that the 
same concept is measured across different groups. For example, 
before comparing scores collected in two different cultures, it 
is useful to verify that a given measure is interpreted in a con- 
ceptually similar manner by respondents representing these two 
cultures (e.g., Nye et al, 2008). There are different levels of mea- 
surement invariance (e.g., configural, weak, and strong) which 
specify the type of comparison permissible between these mea- 
sures (Vandenberg and Lance, 2000). In principle, we cannot 
even compare self-ratings with observer-ratings before measure- 
ment invariance is established on one of these possible levels 
between these two measures. It would be a relatively easy call 
if, for example, self-ratings were represented by two sufficiently 
large subpopulations but other-ratings needed only one sample 
to approximate all data. Quite obviously, in this case, self- and 
other-rated data are not fully comparable because two com- 
pletely different functions are needed to approximate responses. 
The decision becomes much more uncertain when the estimated 
number of subpopulations is identical. There seems to be no sim- 
ple rule how to decide if two miKed-binomial distributions are 
identical. Since a binomial distribution has only one parame- 
ter, it is not entirely clear what to do with weak (metric) and 
strong (scalar) invariance, which are based on the understand- 
ing that a latent variable is related to manifest responses through 
an equation containing two parameters: the regression coefficient 
and residual constant. Without doubt, the binomial measurement 
model provides not only promise but also questions for future 
studies of measurement invariance. 

There is a typical way how researchers react to a situation in 
which a proposed theoretical model does not fit empirical data 
very well (cf , Goldstein and Blinkhorn, 1982). For the start they 
usually recommend eliminating a considerable number of deviant 
respondents. Then they suggest excluding few items which may 
look abnormal and finally they may even advise to dismiss certain 
response categories (such as "neutral") which are inflating the fit 
characteristics (cf , Andrich, 1978). Alternative and more produc- 
tive strategy is to modify response models is such way that they 
could explain at least some of the observed anomalies. Results 
of this study made it absolutely clear that the binomial response 
curve has chances to describe accurately answers on a Likert scale 
only with one of two amendments based on diametrically oppo- 
site principles. The first of these two principles is random or 
erratic responding. As a simulation study demonstrated even a 
small amount of random answers could distort perfectly bino- 
mial response pattern to such an extent that according to a very 
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Strict x.^ criterion none of the items fitted to a theoretical predic- 
tion. Thus, for a realistic binomial response model it is necessary 
to supplement it with one more parameter which corresponds 
to a certain amount of random noise added to the answers. 
The second principle is systematic bias which expresses itself in 
the selective preferences of one response category over all oth- 
ers. It is impossible to predict distribution of responses on these 
quite numerous items with binomial or any other continuous 
cumulative function since some response frequencies are out of 
theoretically permissible limits. Only viable option how to explain 
the response pattern on these items is to propose that in addition 
to a general probability pB to endorse this statement there is an 
inclination to choose or avoid one response category which could 
happen in a certain number of cases. The introduction of an ad 
hoc variable is, no doubt, an inelegant solution. But a very costly 
alternative is to replace biased items in all popular personality 
questionnaires with new items that agree better with a theoretical 
response curve. 

Like numerous previous studies of various personality ques- 
tionnaires (Rost, 1990, 1991; Boh et al, 2001; Asparouhov 
and Muthen, 2008; Maij-de Meij et al, 2008; Muthen and 
Asparouhov, 2009) this study demonstrated that one of the most 
sophisticated and widely used personality instruments NEO-PI 
contains a very large number of items which are understood dif- 
ferently by various groups of participants. However, the fact that 
87-90% of all items demonstrated improvement when IPBM was 
replaced with 3PBM does not automatically speak in favor of two 
sufficiently large groups of participants who understand items 
differently. However, there is little reservation that items with two 
prominent response peaks (see Figures) are understood differ- 
ently by two distinct subgroups of respondents: one group of 
participants preferred one end of the response scale while others 
selected answers on the opposite end of the scale. It is less likely 
that items with a more uniform response pattern (all response 
alternatives are chosen more uniformly) indicate necessarily for 
two or more different subsamples of respondents. Two equally 
weighted binomial distributions with two sufficiently separated 
probabilities could be just a technical trick with which to approx- 
imate a distribution which resists approximation by a single 
binomial function. Unexpectedly, a substantial number of items 
had V-type response pattern distribution and it is unrealistic to 
assume that one of the most elaborated personality measurement 
instruments available NEO-PI contains 30-40% items which are 
not understood identically enough by the participants. It is more 
likely that a characteristic V-shape speaks on numerous occasions 
about another form of response bias — a pervasive tendency to use 
a middle category ("neutral" or R3) less frequently than it could 
be normally expected. Nobody seems to know whether the under- 
use of the middle response category is a distinctive feature of 
personality questionnaires or it is an attribute of odd-numbered 
Likert response scales in general (cf. Kulas et al., 2008; Kulas and 
Stachowski, 2009). 

It is largely acknowledged that application of the parametric 
IRT models to personality data is still problematic (Reise and 
Waller, 1993; Reise and Henson, 2003). There is evidence that 
individuals can respond differently to personality items than to 
ability test items, making IRT application to personality data even 



more challenging (Chernyshenko et al, 2001). Almost all popular 
and widely used personality questionnaires contain items which 
deviate from the common response pattern and are therefore rec- 
ommended to be removed (Spence et al, 2012). It is not rare that 
scales defined as one-dimensional are in fact related to more than 
one latent trait (Reise and Waller, 2003). The parametric models, 
particularly normal and logistic, do not approximate personality 
and psychopathology scales very well sometimes producing mis- 
leading results (Meijer and Baneke, 2004). The basic IRT assump- 
tion that some personality items represent more extreme expres- 
sions of a trait than others (the so-called item "difficulty"), and 
that this pattern of endorsement needs to be identical across indi- 
viduals, is a premise that has yet to be proven (Reise and Waller, 
2003). Many personality items and even scales may be qualita- 
tively different for different groups of persons suggesting that 
unidimensional IRT models are not appropriate for their analysis 
(Egberink et al., 2010). All these evidences may create a mislead- 
ing impression that the way how personality is typically measured 
by NEO-PI or other omnibus personality questionnaires is funda- 
mentally flawed. More realistic, however, is a viewpoint that the 
currently used IRT models need to be revised in order to describe 
adequately how persons answer personality items including Likert 
scale questions. One of these needed revisions may be a systematic 
comparison of alternative response models with the habitual nor- 
mal or logistic IRT models. Of course, it is unrealistic to expect 
that researchers wUl start immediately to use binomial model 
when it turns out that the binomial model provides more pre- 
cise approximation to the observed distribution of answers. The 
advantage of universal and dimensionless probability values ps 
may be relatively small compared with the default options imple- 
mented in all available IRT statistical packages. However, those 
who intend to devise new instruments or revise existing ones 
may find ideas presented in this paper stimulating. First of all, 
it could serve as a tool for identifying potential items which are 
seriously biased. Above I proposed a simple rule how to identify 
biased items. Shortly repeating the principle, any item can be sus- 
pected of being biased if one of its response categories attracts 
substantially, say 10%, more or less answers than could be the- 
oretically predicted. Another intriguing twist would be to use 
differences between observed and predicted response frequencies 
as the weight with which indices of response bias can be com- 
puted. For example, the multiplication of answers with social 
desirability ratings is a convenient way to construct an index 
of social desirability (Konstabel et al, 2006). Analogously, the 
disparity between observed and predicted response probabilities 
could serve as a method for computing different response biases, 
such as the extreme or middle category response biases. 

Applying the binomial latent variable approach allowed us to 
look at personality measures from a slightly new angle. Although 
self- and observer-ratings are very similar and agreement between 
them is substantial (Allik et al, 2010a,b,c), there are still small but 
cross-culturally replicable differences between these two perspec- 
tives (Allik et al, 2010a). The presented data demonstrated that 
other-ratings are systematically more regular than self-ratings in 
terms of their binomial fit. This also means that, in the other- 
report format, there were fewer items which were strongly biased. 
Does it mean that, in judging someone's personality, judges are 
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less vulnerable to social desirability bias compared to when they 
have to make similar judgements about themselves? I have no 
answer to this question. What may be instructive was indirect evi- 
dence that other-ratings maybe less subject to random answering 
than self-ratings. Besides determining the test-retest unreliability, 
this was a novel but indirect way in which to estimate the level of 
noise in personality ratings. Assuming that this method was com- 
prehensive, the results provided another clue to the riddle why the 
measurement of some personality traits such as Neuroticism and 
Conscientiousness are more vulnerable to the influence of noise, 
that is, random answering, while answers to Openness items are 
closest to an ideal binomial or mixed-binomial distribution. The 
strength of any new idea, besides other important qualities, is 
the number of new problems it could raise. It seems that the 
mixed-binomial model has fulfilled this promise, since there are a 
number of new intriguing questions begging for answers. 
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